Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus
Identifieur interne : 000261 ( Main/Exploration ); précédent : 000260; suivant : 000262Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus
Auteurs : Karima Meftouh [Algérie] ; Salima Harrat [Algérie] ; Salma Jamoussi [Tunisie] ; Mourad Abbas [Algérie] ; Kamel SmailiSource :
Abstract
We present in this paper PADIC, a Parallel Arabic DIalect Corpus we built from scratch, then we conducted experiments on cross-dialect Arabic machine translation. PADIC is composed of dialects from both the Maghreb and the Middle-East. Each dialect has been aligned with Modern Standard Arabic (MSA). Three dialects from Maghreb are concerned by this study: two from Algeria, one from Tunisia, and two dialects from the Middle-East (Syria and Palestine). PADIC has been built from scratch because the lack of dialect resources. In fact, Arabic dialects in Arab world in general are used in daily life conversations but they are not written. At the best of our knowledge, PADIC, up to now, is the largest corpus in the community working on dialects and especially those concerning Maghreb. PADIC is composed of 6400 sentences for each of the 5 concerned dialects and MSA. We conducted cross-lingual machine translation experiments between all the language pairs. For translating to MSA we interpolated the corresponding Language Model (LM) with a large Arabic corpus based LM. We also studied the impact of language model smoothing techniques on the results of machine translation because this corpus, even it is the largest one, it still very small in comparison to those used for translation of natural languages.
Url:
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Hal, to step Corpus: 003030
- to stream Hal, to step Curation: 003030
- to stream Hal, to step Checkpoint: 000235
- to stream Main, to step Merge: 000261
- to stream Main, to step Curation: 000261
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus</title>
<author><name sortKey="Meftouh, Karima" sort="Meftouh, Karima" uniqKey="Meftouh K" first="Karima" last="Meftouh">Karima Meftouh</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-21502" status="VALID"><orgName>Laboratoire de Recherche en Informatique</orgName>
<orgName type="acronym">LRI-ANNABA</orgName>
<desc><address><country key="DZ"></country>
</address>
</desc>
<listRelation><relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-300650" type="direct"><org type="institution" xml:id="struct-300650" status="VALID"><orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc><address><addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author><name sortKey="Harrat, Salima" sort="Harrat, Salima" uniqKey="Harrat S" first="Salima" last="Harrat">Salima Harrat</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-101665" status="INCOMING"><orgName>Ecole Nationale Supérieure d'Informatique (ESI ex-INI)</orgName>
<desc><address><country key="DZ"></country>
</address>
</desc>
<listRelation><relation active="#struct-324511" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-324511" type="direct"><org type="institution" xml:id="struct-324511" status="INCOMING"><orgName>Ecole Nationale Supérieure d'Informatique (ESI ex-INI)</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author><name sortKey="Jamoussi, Salma" sort="Jamoussi, Salma" uniqKey="Jamoussi S" first="Salma" last="Jamoussi">Salma Jamoussi</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-206882" status="VALID"><orgName>Multimedia, InfoRmation systems and Advanced Computing Laboratory</orgName>
<orgName type="acronym">MIRACL</orgName>
<desc><address><addrLine>Route de Tunis, km 10, BP 242, Sakiet Ezziet, 3021 SFAX</addrLine>
<country key="TN"></country>
</address>
</desc>
<listRelation><relation active="#struct-350672" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-350672" type="direct"><org type="institution" xml:id="struct-350672" status="INCOMING"><orgName>FSEG-Sfax, ISIM-Sfax</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Tunisie</country>
</affiliation>
</author>
<author><name sortKey="Abbas, Mourad" sort="Abbas, Mourad" uniqKey="Abbas M" first="Mourad" last="Abbas">Mourad Abbas</name>
<affiliation wicri:level="1"><hal:affiliation type="institution" xml:id="struct-267396" status="VALID"><orgName>Centre de Recherche Scientifique et Technique pour le Dévelopement de la Langue Arabe</orgName>
<orgName type="acronym">CRSTDLA</orgName>
<desc><address><addrLine>1,Rue Djamel Eddine EL-Afghani B.P :225. Rostomia-Bouzareah Alger - 16011</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.crstdla.edu.dz/fr/</ref>
</desc>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author><name sortKey="Smaili, Kamel" sort="Smaili, Kamel" uniqKey="Smaili K" first="Kamel" last="Smaili">Kamel Smaili</name>
<affiliation><hal:affiliation type="laboratory" xml:id="struct-446632" status="INCOMING"><orgName>LORIA - UMR 7503,Campus Scientifique - BP 239, 54506 Vandoeuvre-les-Nancy Cedex, France</orgName>
<listRelation><relation active="#struct-446629" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-446629" type="direct"><org type="institution" xml:id="struct-446629" status="INCOMING"><orgName>Laboratoire LORIA</orgName>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-01261587</idno>
<idno type="halId">hal-01261587</idno>
<idno type="halUri">https://hal.archives-ouvertes.fr/hal-01261587</idno>
<idno type="url">https://hal.archives-ouvertes.fr/hal-01261587</idno>
<date when="2015-10-30">2015-10-30</date>
<idno type="wicri:Area/Hal/Corpus">003030</idno>
<idno type="wicri:Area/Hal/Curation">003030</idno>
<idno type="wicri:Area/Hal/Checkpoint">000235</idno>
<idno type="wicri:explorRef" wicri:stream="Hal" wicri:step="Checkpoint">000235</idno>
<idno type="wicri:Area/Main/Merge">000261</idno>
<idno type="wicri:Area/Main/Curation">000261</idno>
<idno type="wicri:Area/Main/Exploration">000261</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus</title>
<author><name sortKey="Meftouh, Karima" sort="Meftouh, Karima" uniqKey="Meftouh K" first="Karima" last="Meftouh">Karima Meftouh</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-21502" status="VALID"><orgName>Laboratoire de Recherche en Informatique</orgName>
<orgName type="acronym">LRI-ANNABA</orgName>
<desc><address><country key="DZ"></country>
</address>
</desc>
<listRelation><relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-300650" type="direct"><org type="institution" xml:id="struct-300650" status="VALID"><orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc><address><addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author><name sortKey="Harrat, Salima" sort="Harrat, Salima" uniqKey="Harrat S" first="Salima" last="Harrat">Salima Harrat</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-101665" status="INCOMING"><orgName>Ecole Nationale Supérieure d'Informatique (ESI ex-INI)</orgName>
<desc><address><country key="DZ"></country>
</address>
</desc>
<listRelation><relation active="#struct-324511" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-324511" type="direct"><org type="institution" xml:id="struct-324511" status="INCOMING"><orgName>Ecole Nationale Supérieure d'Informatique (ESI ex-INI)</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author><name sortKey="Jamoussi, Salma" sort="Jamoussi, Salma" uniqKey="Jamoussi S" first="Salma" last="Jamoussi">Salma Jamoussi</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-206882" status="VALID"><orgName>Multimedia, InfoRmation systems and Advanced Computing Laboratory</orgName>
<orgName type="acronym">MIRACL</orgName>
<desc><address><addrLine>Route de Tunis, km 10, BP 242, Sakiet Ezziet, 3021 SFAX</addrLine>
<country key="TN"></country>
</address>
</desc>
<listRelation><relation active="#struct-350672" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-350672" type="direct"><org type="institution" xml:id="struct-350672" status="INCOMING"><orgName>FSEG-Sfax, ISIM-Sfax</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Tunisie</country>
</affiliation>
</author>
<author><name sortKey="Abbas, Mourad" sort="Abbas, Mourad" uniqKey="Abbas M" first="Mourad" last="Abbas">Mourad Abbas</name>
<affiliation wicri:level="1"><hal:affiliation type="institution" xml:id="struct-267396" status="VALID"><orgName>Centre de Recherche Scientifique et Technique pour le Dévelopement de la Langue Arabe</orgName>
<orgName type="acronym">CRSTDLA</orgName>
<desc><address><addrLine>1,Rue Djamel Eddine EL-Afghani B.P :225. Rostomia-Bouzareah Alger - 16011</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.crstdla.edu.dz/fr/</ref>
</desc>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author><name sortKey="Smaili, Kamel" sort="Smaili, Kamel" uniqKey="Smaili K" first="Kamel" last="Smaili">Kamel Smaili</name>
<affiliation><hal:affiliation type="laboratory" xml:id="struct-446632" status="INCOMING"><orgName>LORIA - UMR 7503,Campus Scientifique - BP 239, 54506 Vandoeuvre-les-Nancy Cedex, France</orgName>
<listRelation><relation active="#struct-446629" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-446629" type="direct"><org type="institution" xml:id="struct-446629" status="INCOMING"><orgName>Laboratoire LORIA</orgName>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">We present in this paper PADIC, a Parallel Arabic DIalect Corpus we built from scratch, then we conducted experiments on cross-dialect Arabic machine translation. PADIC is composed of dialects from both the Maghreb and the Middle-East. Each dialect has been aligned with Modern Standard Arabic (MSA). Three dialects from Maghreb are concerned by this study: two from Algeria, one from Tunisia, and two dialects from the Middle-East (Syria and Palestine). PADIC has been built from scratch because the lack of dialect resources. In fact, Arabic dialects in Arab world in general are used in daily life conversations but they are not written. At the best of our knowledge, PADIC, up to now, is the largest corpus in the community working on dialects and especially those concerning Maghreb. PADIC is composed of 6400 sentences for each of the 5 concerned dialects and MSA. We conducted cross-lingual machine translation experiments between all the language pairs. For translating to MSA we interpolated the corresponding Language Model (LM) with a large Arabic corpus based LM. We also studied the impact of language model smoothing techniques on the results of machine translation because this corpus, even it is the largest one, it still very small in comparison to those used for translation of natural languages.</div>
</front>
</TEI>
<affiliations><list><country><li>Algérie</li>
<li>Tunisie</li>
</country>
</list>
<tree><noCountry><name sortKey="Smaili, Kamel" sort="Smaili, Kamel" uniqKey="Smaili K" first="Kamel" last="Smaili">Kamel Smaili</name>
</noCountry>
<country name="Algérie"><noRegion><name sortKey="Meftouh, Karima" sort="Meftouh, Karima" uniqKey="Meftouh K" first="Karima" last="Meftouh">Karima Meftouh</name>
</noRegion>
<name sortKey="Abbas, Mourad" sort="Abbas, Mourad" uniqKey="Abbas M" first="Mourad" last="Abbas">Mourad Abbas</name>
<name sortKey="Harrat, Salima" sort="Harrat, Salima" uniqKey="Harrat S" first="Salima" last="Harrat">Salima Harrat</name>
</country>
<country name="Tunisie"><noRegion><name sortKey="Jamoussi, Salma" sort="Jamoussi, Salma" uniqKey="Jamoussi S" first="Salma" last="Jamoussi">Salma Jamoussi</name>
</noRegion>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000261 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000261 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Wicri/Lorraine |area= InforLorV4 |flux= Main |étape= Exploration |type= RBID |clé= Hal:hal-01261587 |texte= Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus }}
This area was generated with Dilib version V0.6.33. |